Introduction

The goal for this project is to create a model that can predict where in the draft an NFL prospect will be drafted. I will be combining multiple data sets to yield better results, the data points I’m going to be focusing on are a player’s position, what school they went to, and their combine stats. This is a multiclass classification problem that will use logistical models to make its predictions.

What is the NFL Draft?

The NFL Draft is an opportunity to for NFL teams to select players. Each team picks in order relative to how well they did the previous season. The team with the worst record picks first, then the second worst record picks second, and so on. This continues until all 32 NFL teams have picked a player, then they redo the same process in the same order. Each 32 group of picks is referred to as a “round” and there are 7 total rounds in the draft.

Why will this be useful?

This model can be useful for the teams drafting players and the prospects entering the draft. For teams, this model could help evaluate players, and determine where they should be drafted. Also comparing them to mock drafts to see if they are under or over valued. For prospects, this model can be used to gain perspective on whether they should enter the draft before their senior year in college. If the model predicts they’ll be drafted, or have a high draft projection, then it may be a good time to declare. But, if the model predicts they won’t be drafted, or drafted too low, they should play another year of college to develop.

Exploritory Data Analysis

To find the appropriate data for my project, I searched through the kaggle data base and found two data sets that will be useful.

nfl_draft_prospects

Source: nfl_draft_prospects

Author: Jack Lichtenstein, Publication: May 5th 2021

Raw Data

Looking at the raw data there are 24 columns, but some of these can be deleted since they aren’t helpful. Obvious ones include: player_id, link, traded, trade_note, team, team_abbr, team_logo_espn, guid, player_image. All of these are either links that we can’t use, or are related to the team that drafted them (which we aren’t interested in).

ndf_head <- nfl_draft_prospects %>% head()

ndf_head %>%
  kable() %>% 
  kable_styling("striped", full_width = F) %>%
  column_spec(1:ncol(ndf_head), extra_css = "white-space: nowrap;") %>%
  row_spec(0, align = "c") %>%
  scroll_box(width = "100%")
draft_year player_id player_name position pos_abbr school school_name school_abbr link pick overall round traded trade_note team team_abbr team_logo_espn guid weight height pos_rk ovr_rk grade player_image
1967 23590 Bubba Smith Defensive End DE Michigan State Spartans MSU http://insider.espn.com/nfl/draft/player/_/id/23590 1 1 1 FALSE from New Orleans Baltimore Colts IND https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/ind.png NA NA NA NA NA NA NA
1967 23591 Clinton Jones Running Back RB Michigan State Spartans MSU http://insider.espn.com/nfl/draft/player/_/id/23591 2 2 1 FALSE from N.Y. Giants Minnesota Vikings MIN https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/min.png NA NA NA NA NA NA NA
1967 23592 Steve Spurrier Quarterback QB Florida Gators FLA http://insider.espn.com/nfl/draft/player/_/id/23592 3 3 1 FALSE from Atlanta San Francisco 49ers SF https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/sf.png NA NA NA NA NA NA NA
1967 23593 Bob Griese Quarterback QB Purdue Boilermakers PUR http://insider.espn.com/nfl/draft/player/_/id/23593 4 4 1 FALSE NA Miami Dolphins MIA https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/mia.png NA NA NA NA NA NA NA
1967 23594 George Webster Linebacker LB Michigan State Spartans MSU http://insider.espn.com/nfl/draft/player/_/id/23594 5 5 1 FALSE NA Houston Oilers TEN https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/ten.png NA NA NA NA NA NA NA
1967 23595 Floyd Little Running Back RB Syracuse Orange SYR http://insider.espn.com/nfl/draft/player/_/id/23595 6 6 1 FALSE NA Denver Broncos DEN https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/den.png NA NA NA NA NA NA NA
# Only keeping players drafted in the 2000 draft and after
prospect_2000 <- subset(nfl_draft_prospects, draft_year >= 2000)

# Visualizing the missing data
prospect_final1 <- prospect_2000
prospect_final1 %>%
  vis_miss()

Column deletion

  • weight and height: It’s used in the combine stats later

  • pos_rk and ovr_rk: They were directly correlated to draft position and won’t be helpful for trying to predict draft position using other predictors.

  • years 1967 - 1999: These years weren’t included in the combine data set I found, and they were missing a lot of data in important columns

Column addition

  • Division and Conference: Will help us if the school they went to has a correlation with where they are drafted

  • Drafted: Yes or no for if a player was drafted or not

pf_head <- prospect_final1 %>% head()

pf_head %>%
  kable() %>% 
  kable_styling("striped", full_width = F) %>%
  column_spec(1:ncol(pf_head), extra_css = "white-space: nowrap;") %>%
  row_spec(0, align = "c") %>%
  scroll_box(width = "100%")
draft_year Player_Name position Pos school school_abbr Division Conference overall round grade Drafted
6180 2000 Courtney Brown Defensive End DL Penn State PSU FBS Big Ten 1 1 NA Yes
6181 2000 Lavar Arrington Linebacker LB Penn State PSU FBS Big Ten 2 1 NA Yes
6182 2000 Chris Samuels Offensive Tackle OL Alabama ALA FBS SEC 3 1 NA Yes
6183 2000 Peter Warrick Wide Receiver WR Florida State FSU FBS ACC 4 1 NA Yes
6184 2000 Jamal Lewis Running Back RB Tennessee TENN FBS SEC 5 1 NA Yes
6185 2000 Corey Simon Defensive Tackle DL Florida State FSU FBS ACC 6 1 NA Yes

combine_results

Source: combine_results

Author: Mitchell Weg, Publication: June 8th 2023

Raw Data

All 11 of these columns will be useful because they tell the physical attributes of a player, or give us information for merging with the nfl_draft_prospects data set, I still need to convert the height into inches and format the positions and names so they can be merged with the other data set

#Putting all of the combine results into one data set
combine_results_all <- bind_rows(combine_results_2000, combine_results_2001, combine_results_2002, combine_results_2003, combine_results_2004, combine_results_2005, combine_results_2006, combine_results_2007, combine_results_2008, combine_results_2009, combine_results_2010, combine_results_2011, combine_results_2012, combine_results_2013, combine_results_2014, combine_results_2015, combine_results_2016, combine_results_2017, combine_results_2018, combine_results_2019, combine_results_2020, combine_results_2021,)

# Making a copy
combine_results_all1 <- combine_results_all

cr_head <- combine_results_all1 %>% head()

cr_head %>%
  kable() %>% 
  kable_styling("striped", full_width = F) %>%
  column_spec(1:ncol(cr_head), extra_css = "white-space: nowrap;") %>%
  row_spec(0, align = "c") %>%
  scroll_box(width = "100%")
Player Pos School Ht Wt X40yd Vertical Bench Broad.Jump X3Cone Shuttle
John Abraham OLB South Carolina 6-4 252 4.55 NA NA NA NA NA
Shaun Alexander RB Alabama 6-0 218 4.58 NA NA NA NA NA
Darnell Alford OT Boston Col. 6-4 334 5.56 25 23 94 8.48 4.98
Kyle Allamon TE Texas Tech 6-2 253 4.97 29 NA 104 7.29 4.49
Rashard Anderson CB Jackson State 6-2 206 4.55 34 NA 123 7.18 4.15
Jake Arians K Ala-Birmingham 5-10 202 NA NA NA NA NA NA
combine_results_all1 %>%
  vis_miss()

cra_head <- combine_results_all2 %>% head()

cra_head %>%
  kable() %>% 
  kable_styling("striped", full_width = F) %>%
  column_spec(1:ncol(cra_head), extra_css = "white-space: nowrap;") %>%
  row_spec(0, align = "c") %>%
  scroll_box(width = "100%")
Player_Name Pos School weight X40yd Vertical Bench Broad.Jump X3Cone Shuttle height
John Abraham LB South Carolina 252 4.55 NA NA NA NA NA 76
Shaun Alexander RB Alabama 218 4.58 NA NA NA NA NA 72
Darnell Alford OL Boston Col. 334 5.56 25 23 94 8.48 4.98 76
Kyle Allamon TE Texas Tech 253 4.97 29 NA 104 7.29 4.49 74
Rashard Anderson DB Jackson State 206 4.55 34 NA 123 7.18 4.15 74
Jake Arians K Ala-Birmingham 202 NA NA NA NA NA NA 70

pcm_inorder

P.C.M: Prospect Combine Merge

The merging of both data sets into one giving us a comprehensive list of players that have been drafted between 2000 and 2021, with their combine stats.

pcm_head <- pcm_inorder %>% head()

pcm_head %>%
  kable() %>% 
  kable_styling("striped", full_width = F) %>%
  column_spec(1:ncol(pcm_head), extra_css = "white-space: nowrap;") %>%
  row_spec(0, align = "c") %>%
  scroll_box(width = "100%")
Player_Name overall round Drafted draft_year grade Pos position height weight X40yd Vertical Bench Broad.Jump X3Cone Shuttle school_abbr school Division Conference
1156 Courtney Brown 1 1 Yes 2000 NA DL Defensive End 77 269 4.78 NA NA NA NA NA PSU Penn State FBS Big Ten
3382 Lavar Arrington 2 1 Yes 2000 NA LB Linebacker 75 250 4.53 NA NA NA NA NA PSU Penn State FBS Big Ten
979 Chris Samuels 3 1 Yes 2000 NA OL Offensive Tackle 77 325 5.08 NA NA NA NA NA ALA Alabama FBS SEC
4127 Peter Warrick 4 1 Yes 2000 NA WR Wide Receiver 71 194 4.58 NA NA NA NA NA FSU Florida State FBS ACC
2284 Jamal Lewis 5 1 Yes 2000 NA RB Running Back 72 240 4.58 NA 23 NA NA NA TENN Tennessee FBS SEC
1127 Corey Simon 6 1 Yes 2000 NA DL Defensive Tackle 74 297 4.83 NA NA NA NA NA FSU Florida State FBS ACC

Code Book

  • Player_Name(chr): Name of the player
  • overall(int): The overall pick a player was drafted
  • round(factor): The round a player was drafted in
  • Drafted(factor): If a player was or wasn’t drafted
  • draft_year(int): The year a player was drafted
  • grade(int): ESPN’s evaluation of the player, 100 best and 0 worst
  • Pos(factor): Abbreviation of a player’s position
  • position(chr): Full length of a player’s position
  • height(int): Player’s height
  • weight(int): Player’s weight
  • X40yd(int): Player’s 40 yard dash time
  • Vertical(int): Player’s vertical leap
  • Bench(int): Player’s bench press reps of 225lbs
  • Broad.Jump(int): Player’s broad jump
  • X3Cone(int): Player’s time for 3 cone drill
  • Shuttle(int): Player’s time for shuttle run
  • school_abbr(chr): Abbreviation of a player’s college
  • school(chr): Full length of a player’s college
  • Division(factor): If a player played in the FBS or FCS division
  • Conference(factor): Which conference the player’s college was in, FCS or a conference in FBS

Missing Data

Looking at the data ordered by year it’s clear that from the years 2000 to 2003, ESPN simply didn’t give prospects a grade, I think the best course of action is to delete these years from the data. It won’t create bias since these years are independent from other draft years.

pcm_inorder %>% vis_miss()

Looking at the missingness by position, the only blanks that aren’t at random are for Quarter backs, Punters, and Kickers. Quarter backs don’t record bench at the combine, while Kickers and Punters only record 40 yard times. Also, the second gap in bench is due to Wide Receivers not recording bench from 2004 to 2006. For the Quarter backs, Punter, and Kickers I’m inputting 0s for the missingness. I’ll be using bagged trees to fill in the rest of the missing data.

pos_order %>% vis_miss()

pcmf_head <- pcm_full %>% head()

pcmf_head %>%
  kable() %>% 
  kable_styling("striped", full_width = F) %>%
  column_spec(1:ncol(pcmf_head), extra_css = "white-space: nowrap;") %>%
  row_spec(0, align = "c") %>%
  scroll_box(width = "100%")
Player_Name overall round Drafted draft_year grade Pos position height weight X40yd Vertical Bench Broad.Jump X3Cone school_abbr school Division Conference Shuttle
Alan Reuber 0 Undrafted No 2004 67 OL Offensive Guard 78 323 5.49 29.0 26 98 7.95 TA&M Texas A&M FBS SEC 4.91
Andrae Thurman 0 Undrafted No 2004 48 WR Wide Receiver 71 192 4.54 34.5 15 121 7.31 SOU Southern Oregon FCS FCS 4.30
Andrew Shull 0 Undrafted No 2004 59 DL Defensive End 77 265 4.90 30.5 16 107 7.46 KSU Kansas State FBS Big 12 4.28
Anthony Herrera 0 Undrafted No 2004 60 OL Offensive Guard 74 315 5.20 28.5 26 104 7.76 TENN Tennessee FBS SEC 4.71
Antonio Hall 0 Undrafted No 2004 43 OL Offensive Tackle 75 317 5.54 26.5 27 101 8.12 UK Kentucky FBS SEC 4.55
Arnold Parker 0 Undrafted No 2004 58 DB Safety 74 213 4.54 35.5 18 120 6.98 NA NA FCS FCS 4.12
pcm_full %>% vis_miss()

We have no more missingness!

We can now learn more about our data through visualization… Graphs!

Graphs

Which position is drafted the most?

Defensive Back is the highest because DB is really two positions in one: Corner Back and Safety. They make up the majority of the defense so it makes sense that they are drafted the most. Something I didn’t expect was for Wide Receiver to be higher than Running Back, I would have predicted RBs to be higher because on average they have the shortest careers, approximately 2.57 years. But I guess WR are used more on the field on average, usually 3 WR and 1 RB on any given play, so it makes sense they are drafted more often.

pcm_drafted <- subset(pcm_full, Drafted == "Yes")

ggplot(pcm_drafted, aes(x = fct_infreq(Pos))) +
  geom_bar(fill = 'navy') +
  labs(title = "Prospects Drafted by Position", x = "Position", y = "Players Drafted") +
  theme_minimal()

Which position is most commonly drafted in the first round? First overall pick?

What I notice in this chart is that Offensive linemen, Defensive linemen, and Quarter Backs are picked more often in the first round. O-linemen are picked the 5th most in the draft but are picked the 3rd most in the first round, while QBs are picked the 9th most in the draft but are picked the 6th most in the first round. Also, D-linemen are picked the most in the first round and second most as the first overall pick. This tells me that DL, OL, and QB are the most valuable positions in football, since they are picked before anyone else.

pcm_full_r1 <- subset(pcm_full, round == "1")

pcm_full_r11 <- subset(pcm_full, round == "1" & overall == "1")

pos_r1 <- ggplot(pcm_full_r1, aes(x = fct_infreq(Pos))) +
  geom_bar(fill = 'navy') + 
  labs(title = "Prospects Drafted in the First Round", x = "Position", y = "Prospects") +
  theme_minimal()

pos_r11 <- ggplot(pcm_full_r11, aes(x = fct_infreq(Pos))) +
  geom_bar(fill = 'navy') + 
  labs(title = "Prospects Drafted First Overall", x = "Position", y = "Prospects") +
  theme_minimal()

grid.arrange(pos_r1, pos_r11, ncol = 2)

Does position have an affect on draft round?

QBs are most often picked in the first round and take a sharp drop in the second round. RBs are picked the most in the 4th round, and take a steep drop in round five, likely because in round five is when most kickers are selected

pcm_QB <- subset(pcm_full, Pos == "QB" & Drafted == "Yes")
QB_pct <- as.data.frame(prop.table(table(pcm_QB$round)) * 100)

pcm_RB <- subset(pcm_full, Pos == "RB" & Drafted == "Yes")
RB_pct <- as.data.frame(prop.table(table(pcm_RB$round)) * 100)

pcm_K <- subset(pcm_full, Pos == "K" & Drafted == "Yes")
K_pct <- as.data.frame(prop.table(table(pcm_K$round)) * 100)

QB_chart <- ggplot(QB_pct, aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity", fill = 'navy') +
  ylim(c(0,40)) +
  labs(title = "Quarter Backs", x = "Rounds", y = "Percentage") +
  theme_minimal()

RB_chart <- ggplot(RB_pct, aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity", fill = 'navy') +
  ylim(c(0,40)) +
  labs(title = "Running Backs", x = "Rounds", y = "Percentage") +
  theme_minimal()

K_chart <- ggplot(K_pct, aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity", fill = 'navy') +
  ylim(c(0,40)) +
  labs(title = "Kickers", x = "Rounds", y = "Percentage") +
  theme_minimal()

grid.arrange(K_chart, RB_chart, QB_chart, ncol = 3, top = "Percentage Picked by Round")

Which conferences have been picked the most?

SEC is by far the most popular conference NFL teams draft from, some schools from this conference include: Alabama, Georgia, LSU, Tennessee, Texas A&M and other elite level programs

  ggplot(pcm_drafted, aes(x = fct_infreq(Conference))) + 
    geom_bar(fill = 'navy') +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(title = "Prospects Drafted by Conference", x = "Conference", y = "Players Drafted")

Which conferences have picked the most in the first round?

None of the top five schools move positions, but this graph shows the tremendous gap between the five power conferences and everyone else

  ggplot(pcm_full_r1, aes(x = fct_infreq(Conference))) + 
    geom_bar(fill = 'navy') + 
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(title = "Prospects Drafted by Conference in the First Round", x = "Conference", y = "Players Drafted")

Which of my continous predictors are correlated to eachother?

Overall pick seems to be the only numerical variable that is strongly correlated with ESPN grade. Other strongly correlated variables include 3_Cone:Shuttle (both involve short burst sprints), Vertical:Broad Jump(jumping vertically and horizontally, makes sense their correlated), Weight:40yd (The heavier you are the slower you are and vice versa), Vert:40yd negative (low 40 time equates to high vertical jump) and Weight:Bench (The heavier you are the stronger you are and vice versa).

pcm_numeric <- data.frame(grade=pcm_drafted$grade, draft_year=pcm_drafted$draft_year, overall=pcm_drafted$overall, height=pcm_drafted$height, 
                          weight=pcm_drafted$weight, x40yd=pcm_drafted$X40yd, Vert=pcm_drafted$Vertical, Bench=pcm_drafted$Bench, 
                          Broad_Jump=pcm_drafted$Broad.Jump, x3Cone=pcm_drafted$X3Cone, Shuttle=pcm_drafted$Shuttle)
pcm_cor <- cor(pcm_numeric)
corrplot(pcm_cor, method = 'color', type='lower')

Have DL and OL prospects gotten faster as the years have gone on?

They don’t seem to be getting faster every year, but it’s obvious that DL are faster than OL. There also appears to be a standard for DL to have a 5 second 40yd time, since every year they’re hovering around that speed.

pcm_WR_RB <- subset(pcm_full, pcm_full$Pos == "OL" | pcm_full$Pos == "DL", 
                    select = c(draft_year, Pos, X40yd))
pcm_WR_RB <- aggregate(X40yd ~ Pos + draft_year, data = pcm_WR_RB, FUN = mean)

ggplot(pcm_WR_RB, aes(fill=Pos, x=draft_year, y=X40yd)) +
  geom_bar(position = "Dodge", stat = "identity") +
  ylim(c(0,6)) + 
  labs(x = 'Draft Year', y = '40 Yard Time') + 
  theme_minimal()

Are players getting faster/stronger?

NFL prospects have gotten significantly faster since 2015, but oddly since 2015 bench press has also gotten significantly lower. This may be due to stricter rules for bench and loser rules for 40yd time in 2016. Or maybe more lighter weight positions like DB and WR were invited to the combine over heavier players like O and D-lineman, bringing down the average for bench and increasing the speed in 40 times.

pcm_40_avg <- aggregate(X40yd ~ draft_year, pcm_full, mean)
# Getting rid of QBs Punters and Kickers from bench since they are all zero
pcm_Bench_avg <- subset(pcm_full, !(Pos %in% c("QB", "K", "P")))
pcm_Bench_avg <- aggregate(Bench ~ draft_year, pcm_Bench_avg, mean)


x40_avg_line <- ggplot(pcm_40_avg, aes(x = draft_year, y = X40yd)) +
  geom_line() +
  labs(y = "40yd Time", x = "Year", title = "40yd Time") +
  theme_minimal()

Bench_avg_line <- ggplot(pcm_Bench_avg, aes(x = draft_year, y = Bench)) +
  geom_line() +
  labs(x = "Year", y = "Bench Reps", title = "Bench Press") +
  theme_minimal()

grid.arrange(x40_avg_line, Bench_avg_line, ncol = 2, top = "Average Results per Year")

Model Set Up

Finally, we can start creating out models

Making Groupings for Round

Making a model that tries to predict 8 classes wouldn’t be effective. It’s too many for any model to predict, so I’m going to instead split the rounds into 4 classes.

pcm_full$round <- gsub("\\b(1|2)\\b", "1st or 2nd", pcm_full$round, ignore.case = TRUE)
pcm_full$round <- gsub("\\b(3|4)\\b", "3rd or 4th", pcm_full$round, ignore.case = TRUE)
pcm_full$round <- gsub("\\b(5|6)\\b", "5th or 6th", pcm_full$round, ignore.case = TRUE)
pcm_full$round <- gsub("\\b(7|Undrafted)\\b", "7th or UD", pcm_full$round, ignore.case = TRUE)
  • 1st or 2nd: First or Second Round Draft Pick (High)
  • 3rd or 4th: Third or Fourth Round Draft Pick (Middle High)
  • 5th or 6th: Fifth or Sixth Round Draft Pick (Middle Low)
  • 7th or UD: Seventh or Undrafted (Low)

Final Cleaning

# Cleaning the variable names
pcm_full_c <- clean_names(pcm_full)

# Changing necessary character variables to factors
pcm_full_c$round <- as.factor(pcm_full_c$round)
pcm_full_c$drafted <- as.factor(pcm_full_c$drafted)
pcm_full_c$pos <- as.factor(pcm_full_c$pos)
pcm_full_c$conference <- as.factor(pcm_full_c$conference)

# Changing all dbl to int
pcm_full_c <- pcm_full_c %>% mutate_if(is.double, as.integer)

Splitting Data

set.seed(1936)

# Splitting the data by 80% and stratifying by round
nfl_split <- initial_split(pcm_full_c, prop = 0.80, strata = round)

nfl_train <- training(nfl_split)
nfl_test <- testing(nfl_split)

dim(nfl_train)  
## [1] 3703   20
dim(nfl_test)
## [1] 927  20

3,703 observations for the training data and 927 observations for the testing data

K-fold Cross Validation

Is the data imbalanced enough where stratified sampling for cross-validation is necessary?

ggplot(pcm_full_c, aes(x = fct_infreq(pcm_full$round))) +
  geom_bar(fill = 'navy') +
  labs(x = 'Classes', y = "Players Drafted") +
  theme_minimal()

Looking at the graph, there is a significant enough imbalance between 7th or UD and the other classes that stratified sampling is warranted

nfl_fold <- vfold_cv(nfl_train, v = 5, strata = round)

Recipe Building

For my predictors I’m going to use:

grade, pos, height, weight, x40yd, vertical, bench, broad_jump, x3Cone, shuttle, and conference

nfl_recipe <- recipe(round ~ grade + pos + height + weight + x40yd + vertical + bench + 
                       broad_jump + x3cone + shuttle + conference, data = nfl_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors())

prep(nfl_recipe) %>%
  bake(new_data = nfl_train) %>%
  head() %>%
  kable() %>% 
  kable_styling("striped", full_width = F) %>%
  column_spec(1:32, extra_css = "white-space: nowrap;") %>%
  row_spec(0, align = "c") %>%
  scroll_box(width = "100%")
grade height weight x40yd vertical bench broad_jump x3cone shuttle round pos_DL pos_FB pos_K pos_LB pos_LS pos_OL pos_P pos_QB pos_RB pos_TE pos_WR conference_American conference_Big.12 conference_Big.Ten conference_Conference.USA conference_FBS.Independents conference_FCS conference_Mid.American conference_Mountain.West conference_Pac.12 conference_SEC conference_Sun.Belt
1.875020 1.1804448 -0.4748162 -0.5261406 -0.0390914 -2.4138263 0.1481131 0.2786202 0.1382879 1st or 2nd -0.4366048 -0.0918694 -0.0903632 -0.370941 -0.0435135 -0.421161 -0.1004491 4.1735094 -0.3038181 -0.2494638 -0.4054886 -0.1665748 -0.3528072 -0.4176002 -0.1193266 -0.1691254 -0.3163729 -0.1407973 -0.211784 -0.3877069 1.9562380 -0.1304749
1.927180 1.9259995 1.7791165 -0.5261406 -0.3796929 0.6349141 -0.4734706 0.2786202 0.1382879 1st or 2nd -0.4366048 -0.0918694 -0.0903632 -0.370941 -0.0435135 2.373748 -0.1004491 -0.2395418 -0.3038181 -0.2494638 -0.4054886 -0.1665748 -0.3528072 2.3939881 -0.1193266 -0.1691254 -0.3163729 -0.1407973 -0.211784 -0.3877069 -0.5110472 -0.1304749
1.927180 0.4348901 -0.3864267 -0.5261406 0.4718107 -0.1272710 0.4871587 0.2786202 0.1382879 1st or 2nd -0.4366048 -0.0918694 -0.0903632 -0.370941 -0.0435135 -0.421161 -0.1004491 -0.2395418 -0.3038181 -0.2494638 2.4654947 -0.1665748 -0.3528072 -0.4176002 -0.1193266 -0.1691254 -0.3163729 -0.1407973 -0.211784 -0.3877069 -0.5110472 -0.1304749
1.718537 1.1804448 -0.2980371 1.8941063 -0.0390914 -2.4138263 0.1481131 0.2786202 0.1382879 1st or 2nd -0.4366048 -0.0918694 -0.0903632 -0.370941 -0.0435135 -0.421161 -0.1004491 4.1735094 -0.3038181 -0.2494638 -0.4054886 -0.1665748 -0.3528072 -0.4176002 -0.1193266 -0.1691254 -0.3163729 -0.1407973 -0.211784 -0.3877069 -0.5110472 -0.1304749
1.927180 0.0621128 -0.2759398 -0.5261406 0.4718107 -0.1272710 0.4871587 0.2786202 0.1382879 1st or 2nd -0.4366048 -0.0918694 -0.0903632 -0.370941 -0.0435135 -0.421161 -0.1004491 -0.2395418 -0.3038181 -0.2494638 -0.4054886 -0.1665748 -0.3528072 -0.4176002 -0.1193266 -0.1691254 -0.3163729 -0.1407973 -0.211784 -0.3877069 -0.5110472 -0.1304749
1.875020 0.0621128 -0.6736926 -0.5261406 0.4718107 -0.3813327 0.4871587 0.2786202 0.1382879 1st or 2nd -0.4366048 -0.0918694 -0.0903632 -0.370941 -0.0435135 -0.421161 -0.1004491 -0.2395418 -0.3038181 -0.2494638 2.4654947 -0.1665748 2.8336440 -0.4176002 -0.1193266 -0.1691254 -0.3163729 -0.1407973 -0.211784 -0.3877069 -0.5110472 -0.1304749

Model Building

I have chosen five models that I believe will fit my data the best. Elastic Net Regression, Gradient Boosted Trees, K-Nearest Neighbors, Latent Dirichlet Allocation (LDA), and Random Forest. I will train each model using the training data, I will also tune parameters for each model if necessary. I will then measure their performance off of their respective auc_roc value. Finally, I will fit the testing data to the best performing model.

save(net_tune_nfl, file = "/Users/brend/OneDrive/Documents/131 Project/net tune.rda")
save(knn_tune_nfl, file = "/Users/brend/OneDrive/Documents/131 Project/knn tune.rda")
save(rf_tune_nfl, file = "/Users/brend/OneDrive/Documents/131 Project/rf tune 2.rda")
save(bt_tune_nfl, file = "/Users/brend/OneDrive/Documents/131 Project/bt tune.rda")
# Loading in saved tunning
load("/Users/brend/OneDrive/Documents/131 Project/net tune.rda")
load("/Users/brend/OneDrive/Documents/131 Project/knn tune.rda")
load("/Users/brend/OneDrive/Documents/131 Project/rf tune 2.rda")
load("/Users/brend/OneDrive/Documents/131 Project/bt tune.rda")

Elastic Net Regression

For the elastic net model the parameters penalty and mixture were tuned. Penalty controls the complexity of the model and mixture controls the balance between the ridge regression and lasso regression. Looking at the graph for roc_auc, the model would perform best with a small mixture value and a large penalty value.

autoplot(net_tune_nfl, metric = 'roc_auc')

K-Nearest Neighbors

The only parameter that needs to be tuned for knn are the number of neighbors that can be in range. The graph shows us that as the number of neighbors increases, the model performs better. But, it doesn’t perform as well as the other models we tried.

autoplot(knn_tune_nfl, metric = 'roc_auc')

Random Forest

In a random forest there are three parameters that need to be tuned.

  • mrty: number of predictors that will be randomly sampled at each split of the tree model

  • trees: number of trees in the model

  • min_n: minimum number of data points in a node that are required for the mode to be further split

The graphs tell me the data fits a high mrty, a medium number of trees, and a high min_n value.

autoplot(rf_tune_nfl, metric = 'roc_auc')

Gradient Boosted Trees

This model uses the mtry and trees like random forest, but it instead uses learn_rate that controls how much weight each tree has on the overall model.

Looking at the performance, a low learn rate, low predictors, and low trees seemed to work best for fitting the data.

autoplot(bt_tune_nfl, metric = 'roc_auc')

Model ROC_AUC Rates

Even though the LDA model performed the best, I’m more confident in the elastic net model, so that is what I will be using to fit my data.

roc_results %>%
  kable() %>% 
  kable_styling("striped", full_width = T)
Models .metric mean
LDA roc_auc 0.8424373
Elastic Net roc_auc 0.8353271
Random Forest roc_auc 0.8316450
Boosted Tree roc_auc 0.8164595
K-Nearest Neighbors roc_auc 0.7585771

Results

Chosen Parameters

The best parameters for the Elastic Net model are:

best_net_nfl %>% dplyr::select(.config, penalty, mixture) %>%
  kable() %>% 
  kable_styling(full_width = T)
.config penalty mixture
Preprocessor1_Model097 0.0004642 1

Fitting to Test Data

After fitting the parameters and applying the final fitted model to the testing data we get an roc_auc curve of approximatly 0.83.

roc_auc(final_net_nfl_test, truth = round, '.pred_1st or 2nd':'.pred_7th or UD') %>%
  kable() %>% 
  kable_styling(full_width = T)
.metric .estimator .estimate
roc_auc hand_till 0.8326798

This is an okay estimate, it’s what we expected based on our tuning.

Most Important Predictors

Grade was by far the most important predictor for determining where a player is drafted. This makes intuitive sense since if a player is more highly graded (better), then they’ll likely get drafted sooner. Broad jump and weight look like the most important physical attributes for determining draft position. Also having the kicker, D-lineman, or a tight end position was an indicator for draft spot. I’m surprised and slightly disappointed that the conference a player played it had little to no effect on the outcome (it took a long time to clean that data).

final_net_nfl %>% extract_fit_parsnip() %>%
  vip() 

ROC Curve per Class

It appears it’s much easier to predict first and second round selections compared to other rounds. This makes sense because there is a massive skill/talent gap between the first two rounds and everyone else in the draft.

roc_curve(final_net_nfl_test, truth = round, '.pred_1st or 2nd':'.pred_7th or UD') %>%
  autoplot()

Decision Matrix

The gap between the first & second round and everyone else is apparent. The model rarely predicted a highly rated prospect incorrectly. However, the model was especially bad at predicting 5th and 6th round players effectively. It mostly categorized 5th and 6th round players as 7th or undrafted. This could be due to the categories having too much in common with each other, making them harder to discern. Also having four classes makes it harder for the model to predict outcomes, having a model that only predicts two outcomes, like drafted or not drafted, would be better.

conf_mat(final_net_nfl_test, truth = round, .pred_class) %>%
  autoplot(type = "heatmap")

Conclusion

I’ve learned that it is very challenging to predict where players will be drafted. There are a lot more factors that go into drafting players. Other factors include: what position the drafting team is looking to fill, if the player has off-field issues, how old the player is, if the player is injury prone, and several other variables. It simply wasn’t enough to just use a player’s combine stats, and what conference they played in. My model also heavily relied on the ESPN’s grading system which is disappointing, once the grades started becoming lower and inconsistent, it was harder for the model to predict players. Also, it was challenging for the model to predict a multiclass problem, the more classes you try to predict, the harder it is for the model to make the correct predictions. This model can definitely be improved with better predictors, more data, and fewer classes.